A data-driven analysis on how much coffee are people drinking, and what lifestyle conditions correlate.
Author
Affiliation
Vera Jackson
School of Information, University of Arizona
Abstract
Add project abstract here.
Introduction
Coffee is one of the most popular drinks in the world, and in the USA alone, 75% of the population has reported drinking coffee, with almost half of Americans drinking coffee daily (Loftfield, et al. 2016). There are many reasons why people drink coffee - for some it is a habit or part of their morning routine, some need the caffeine, and some just like the way it tastes. Studies have shown that demographic factors, such as gender and race, influence how much coffee someone drinks (Loftfield, et al. 2016). However, one’s lifestyle and environment may also have a part to play in this.
The “Great American Coffee Taste Test” data set come’s from TidyTuesday’s 2024 series is compiled of survey results that were filled out by participants of a taste test, hosted by “world champion barista” James Hoffman and coffee company Cometeer. Cometeer sent 4 unlabelled coffees to over 4,000 customers that would participate in a live taste testing on YouTube while filling out the survey. The survey includes questions about coffee drinking habits, coffee preferences, individual taste test results for each of the 4 provides coffees, and individual demographics.
Question: What are the correlations between coffee consumption and lifestyle?
Introduction
The data has a wide range of questions, with 4,042 responses. Rather than looking at demographics such as gender and race, I was curious about how someone’s lifestyle, primarily focused on one’s working habits, impacts how much coffee they drink. For the purpose of this question, we will only be looking at the following variables, with the following options to answer in the survey:
cups: “How many cups of coffee do you typically drink per day?”
"I primarily work in person", "I primarily work from home", "I do a mix of both"
age: “What is your age?”
"<18 years old", "18-24 years old", "25-34 years old", "35-44 years old", "45-54 years old", "55-64 years old", ">65 years old"”
Removing any responses that did not respond to each of these questions, we are working with 3, 343 responses.
I chose to analyze this question for this data set because I was curious about what may influence coffee drinkers in their consumption habits. In particular, their working conditions. Whether someone is retired or a student, or working from home or at the office, may form someone’s environment and influence their habits. In addition to working and work-from-home status, I also included how many children they have - especially for homemakers, but raising children also is considered a form of labor. Finally, I included age, as this could be another explanatory variable for how much coffee someone drinks.
Overall, my goal for this question is to highlight a trend between coffee drinking and lifestyle that could be reflective of the general American coffee-drinking population.
Approach
With the cups variable, a pie chart is made to demonstrate the distribution of responses for the entire, cleaned data set. The pie chart is color-coded based on the response to how many cups of coffee one drinks per day, with the percentages of those responses that make up the data set.
Once an overall average is observed, a point plot with error bars ranging from the 10th to 90th percentile was constructed, faceted by the responses to the explanatory variables (employment_status, number_children, wfh, age). To obtain the mean and percentiles to be plotted, some calculations had to be conducted.
The point plot was to best to display the similarity and differences between averages for each group. The percentiles also suggested that certain groups may be more likely to lean either way outside of the average, so to further analyze these, one final plot was constructed.
A diverging bar chart, grouped by explanatory variable and filled for the six possible answers to the “cups” variable, was then constructed. Only the answers that were outside of the average were plotted so we could focus on which group is most likely to drink less or more than the average.
Analysis
Discussion
Across the board, the average numbers of cups of coffee coffee-drinkers typically drink per day is one to three cups, with most groups averaging around 2 cups, as shown in Plot 1 and Plot 2. However, a trend is revealed, especially correlating with number of children and age. The average increases from 1 to 3 cups with increasing age and increasing number of children. The average is also higher for those that are retired or work full-time, compared to those that work part-time, or are students, homemakers, or unemployed. There was no notable difference between those who work from home or in person.
This trend is also revealed if only the percentages for the options outside of the average are looked at, as shown in Plot 3. There is a higher percentage of those younger than 18 years old and among those that do not have children that drink less than 1 cup of coffee a day. For those who drink 4 or more cups of coffee, it is more likely that they are older than 65 years old or have 3 or more children. Similarly, more retired individuals reported drinking 4 or more cups of coffee a day than any other employment type, and those that are unemployed had the highest percentage of individuals drinking less than 1 cup a day.
These trends are not necessarily surprising - those working more with a full-time job or with more children may require more caffeine compared to those with no children and only working part-time. Similarly, older individuals may want more coffee, or have developed an affinity and enjoy multiple cups of coffee more for the taste than caffeine.
While these trends may be telling of some correlation, none of the explanatory variables are necessarily causal to how many cups of coffee someone drinks per day. Future work should further breakdown these groups, and especially consider the weight of age or correlation between multiple variables. It is important to keep in mind, that the survey respondents are customers of Cometeer, and are not necessarily representative of the entire US or coffee-drinking population.
References
Source Code
---title: "Coffee Drinking Habits and Working Lifestyle"subtitle: "INFO 526 - Summer 2024 - Final Project"author: - name: "Vera Jackson" affiliations: - name: "School of Information, University of Arizona"description: "A data-driven analysis on how much coffee are people drinking, and what lifestyle conditions correlate."format: html: code-tools: true code-overflow: wrap embed-resources: true code-fold: true code-summary: "Show the code"editor: visualexecute: warning: false echo: falsebibliography: references.bib---```{r}#| label: load-packages#| include: false# Load packages herepacman::p_load(dplyr, tidyverse, glue, scales, here, ggthemes, janitor, ggplot2, readr, ggrepel)``````{r}#| label: setup#| include: false# Plot themeggplot2::theme_set(ggplot2::theme_minimal(base_size =11))# For better figure resolutionknitr::opts_chunk$set(fig.retina =3, dpi =300, fig.width =6, fig.asp =0.618 )``````{r}#| label: load data and cleanup#| include: false# load data herecoffee_survey <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2024/2024-05-14/coffee_survey.csv')# glimpse at data (4,042 rows)coffee_survey# clean and select datacoffee_survey_cleanNA <- coffee_survey %>%select(cups, employment_status, number_children, wfh, age) %>%filter(!is.na(cups)) %>%filter(!is.na(employment_status)) %>%filter(!is.na(wfh)) %>%filter(!is.na(number_children)) %>%filter(!is.na(age)) coffee_survey_cleanNA``````{r, echo=FALSE}#| label: code for plot 1 #| include: false############### cleaning for general and plot 1# selecting relevant variables for both plots (3,343 rows)coffee_survey_count_cups <- coffee_survey_cleanNA %>% count(cups) %>% # counting total count for each employment_status rename(total_cups = n)manual_order <- c("Less than 1", "1", "2", "3", "4", "More than 4")# Reorder the dataset according to the manual ordercoffee_survey_count_cups <- coffee_survey_count_cups[match(manual_order, coffee_survey_count_cups$cups), ]coffee_survey_count_cups#calculating percentagescoffee_survey_count_cups <- coffee_survey_count_cups %>% mutate(percent = total_cups/sum(total_cups))coffee_survey_count_cups <- coffee_survey_count_cups %>% mutate(percent = percent(coffee_survey_count_cups$percent, accuracy = 0.1)) coffee_survey_count_cups#mutate to manually position labels on pie chartcoffee_survey_count_plot1 <- coffee_survey_count_cups %>% mutate(csum = rev(cumsum(rev(total_cups))), pos = total_cups/2 + lead(csum, 1), pos = if_else(is.na(pos), total_cups/2, pos))``````{r, echo=FALSE}#| label: code for plot 2#| include: falsecoffee_survey_filter1 <- coffee_survey %>% select(submission_id, cups, employment_status, number_children, wfh, age) %>% filter(!is.na(cups)) %>% mutate( cups = recode(cups, "Less than 1" = "0", "1" = "1", "2" = "2", "3" = "3", "4" = "4", "More than 4" = "5")) coffee_survey_filter1_longer <- coffee_survey_filter1 %>% pivot_longer( cols = c("employment_status", "number_children", "age", "wfh"), names_to = "explanatory", values_to = "explanatory_value" ) %>% filter(!is.na(explanatory_value)) %>% pivot_longer( cols = c("cups"), names_to = "response", values_to = "response_value" )# now i will group the data and calculate summary statisticscoffee_survey_filter1_longer$response_value <- as.numeric(coffee_survey_filter1_longer$response_value)coffee_survey_stats_all <- coffee_survey_filter1_longer %>% group_by(response) %>% summarise( mean = mean(response_value), low = quantile(response_value, 0.10), high = quantile(response_value, 0.90) ) %>% mutate(across(c("mean"), round, 2)) #mean output was returning values with 4 decimal placescoffee_survey_stats_all$explanatory <- c("All")coffee_survey_stats_all$explanatory_value <- c("")#now by groupcoffee_survey_stats_by_group <- coffee_survey_filter1_longer %>% filter(!is.na(response_value)) %>% group_by(explanatory, explanatory_value, response) %>% summarise( mean = mean(response_value), low = quantile(response_value, 0.10), high = quantile(response_value, 0.90) ) %>% mutate(across(c("mean"), round, 2)) #mean output was returning values with 4 decimal places#now to bind together both groupscoffee_survey_stats <- bind_rows( coffee_survey_stats_all, coffee_survey_stats_by_group)#new label for explanatory variable facetexp.labs <- c( "All", "Age", "Number of \nChildren", "Employment \nStatus", "In-Person or \nVirtual Work")names(exp.labs) <- c( "All", "age", "number_children", "employment_status", "wfh")coffee_survey_stats$explanatory = factor(coffee_survey_stats$explanatory, levels = c("All", "age", "number_children", "employment_status", "wfh"), ordered = TRUE) #manually order explanatory for facet``````{r, echo = FALSE}#| label: code for plot 3#| include: falsesurvey_1 <- coffee_survey_filter1_longer %>% #editing response values so number equals actual response mutate(response_value = case_when( response_value == "0" ~ "Less than 1", response_value == "1" ~ "1", response_value == "2" ~ "2", response_value == "3" ~ "3", response_value == "4" ~ "4", response_value == "5" ~ "More than 4") )coffee_survey_percentage <- survey_1 %>% #calculate sums and percentages for diverging plot filter(!is.na(response_value)) %>% group_by(explanatory, explanatory_value, response, response_value) %>% summarise(count = n(), .groups = "drop") %>% group_by(explanatory_value) %>% mutate(percent_answers = (count / sum(count))) %>% ungroup() %>% mutate(percent_answers_label = percent(percent_answers, accuracy = 1)) %>% mutate(percent_answers = if_else(response_value %in% c("4"), percent_answers/2, percent_answers)) #method to make diverging bar plot with neutral in the middlecoffee_survey_percentage <- coffee_survey_percentage %>% mutate(explanatory_value = fct_relevel(explanatory_value, "<18 years old", "18-24 years old", "25-34 years old", "35-44 years old", "45-54 years old", "55-64 years old", ">65 years old", "None", "1", "2", "3", "More than 3", "Unemployed", "Student", "Homemaker", "Employed part-time", "Employed full-time", "Retired", "I primarily work in person", "I primarily work from home", "I do a mix of both")) %>% mutate(explanatory = fct_relevel(explanatory, "age", "number_children", "employment_status", "wfh"))coffee_survey_percentage$response_value <- factor(coffee_survey_percentage$response_value, levels = c("Less than 1", "1", "2", "3", "4", "More than 4"), ordered = TRUE) #manually order response value for facet```## AbstractAdd project abstract here.## IntroductionCoffee is one of the most popular drinks in the world, and in the USA alone, 75% of the population has reported drinking coffee, with almost half of Americans drinking coffee daily (Loftfield, et al. 2016). There are many reasons why people drink coffee - for some it is a habit or part of their morning routine, some need the caffeine, and some just like the way it tastes. Studies have shown that demographic factors, such as gender and race, influence how much coffee someone drinks (Loftfield, et al. 2016). However, one's lifestyle and environment may also have a part to play in this.The "[Great American Coffee Taste Test](https://github.com/rfordatascience/tidytuesday/blob/master/data/2024/2024-05-14/readme.md)" data set come's from [`TidyTuesday`'s 2024 series](https://github.com/rfordatascience/tidytuesday/tree/master/data/2024#readme) is compiled of survey results that were filled out by participants of a taste test, hosted by "world champion barista" James Hoffman and coffee company [Cometeer](https://cometeer.com/pages/the-great-american-coffee-taste-test). Cometeer sent 4 unlabelled coffees to over 4,000 customers that would participate in a live taste testing on YouTube while filling out the survey. The survey includes questions about coffee drinking habits, coffee preferences, individual taste test results for each of the 4 provides coffees, and individual demographics.## Question: What are the correlations between coffee consumption and lifestyle?### IntroductionThe data has a wide range of questions, with 4,042 responses. Rather than looking at demographics such as gender and race, I was curious about how someone's lifestyle, primarily focused on one's working habits, impacts how much coffee they drink. For the purpose of this question, we will only be looking at the following variables, with the following options to answer in the survey:- `cups`: "How many cups of coffee do you typically drink per day?" - "`Less than 1`", "`1`", "`2`", "`3`", "`4`", "`More than 4`"- `employment_status`: "Employment Status" - "`Retired`", "`Employed full-time`", "`Employed part-time`", "`Homemaker`", "`Student`", "`Unemployed`"- `number_children`: "Number of Children" - "`None`", "`1`", "`2`", "`3`", "`More than 3`"- `wfh`: "Do you work from home or in person?" - `"I primarily work in person"`, `"I primarily work from home"`, `"I do a mix of both"`- `age`: "What is your age?" - `"<18 years old"`, `"18-24 years old"`, `"25-34 years old"`, `"35-44 years old"`, `"45-54 years old"`, `"55-64 years old"`, `">65 years old"`"Removing any responses that did not respond to each of these questions, we are working with 3, 343 responses.I chose to analyze this question for this data set because I was curious about what may influence coffee drinkers in their consumption habits. In particular, their working conditions. Whether someone is retired or a student, or working from home or at the office, may form someone's environment and influence their habits. In addition to working and work-from-home status, I also included how many children they have - especially for homemakers, but raising children also is considered a form of labor. Finally, I included age, as this could be another explanatory variable for how much coffee someone drinks.Overall, my goal for this question is to highlight a trend between coffee drinking and lifestyle that could be reflective of the general American coffee-drinking population.### ApproachWith the `cups` variable, a pie chart is made to demonstrate the distribution of responses for the entire, cleaned data set. The pie chart is color-coded based on the response to how many cups of coffee one drinks per day, with the percentages of those responses that make up the data set.Once an overall average is observed, a point plot with error bars ranging from the 10th to 90th percentile was constructed, faceted by the responses to the explanatory variables (`employment_status`, `number_children`, `wfh`, `age`). To obtain the mean and percentiles to be plotted, some calculations had to be conducted.The point plot was to best to display the similarity and differences between averages for each group. The percentiles also suggested that certain groups may be more likely to lean either way outside of the average, so to further analyze these, one final plot was constructed.A diverging bar chart, grouped by explanatory variable and filled for the six possible answers to the "`cups`" variable, was then constructed. Only the answers that were outside of the average were plotted so we could focus on which group is most likely to drink less or more than the average.### Analysis```{r, warning=FALSE, fig.width=5.5, fig.align="center"}#| label: plot 1#| code-fold: true#| code-summary: "Show the code"coffee_survey_count_plot1 %>% mutate(cups = fct_relevel(cups, "Less than 1", "1", "2", "3", "4", "More than 4")) %>% ggplot(aes(x = "", y = total_cups, fill = cups)) + geom_bar(stat = "identity", width = 1, color = "black") + coord_polar(theta = "y") + scale_x_discrete(NULL, expand = c(0, 0)) + scale_y_continuous(NULL, expand = c(0, 0)) + scale_fill_manual(values = c("#87A5A5FF", "#BCAAA4FF", "#A1887FFF", "#D2A54BFF", "#D2D2C3FF", "#A5A587FF" ), name = "Number of Cups") + geom_label_repel( mapping = aes(y = pos, label = paste(percent)), size = 3, nudge_x = 0.9, show.legend = FALSE) + labs( title = "Plot 1. Distrubution of Daily Cups of Coffee \nAmong Responses \n ", caption = "Source: 'The Great American Coffee Taste Test' Data Set \n Retreived from TidyTuesday 2024" ) + theme_void() + theme( axis.text = element_blank(), plot.title = element_text(size = 12, face = "bold", hjust = 0.5), plot.title.position = "plot", legend.text = element_text(size = 8), #text of legend legend.key.size = unit(0.5, "cm") ) ``````{r, warning=FALSE, fig.width=5.5, fig.align="center"}#| label: plot 2#| code-fold: true#| code-summary: "Show the code"coffee_survey_stats %>% mutate(explanatory_value = fct_relevel(explanatory_value, "", "<18 years old", "18-24 years old", "25-34 years old", "35-44 years old", "45-54 years old", "55-64 years old", ">65 years old", "None", "1", "2", "3", "More than 3", "Unemployed", "Student", "Homemaker", "Employed part-time", "Employed full-time", "Retired", "I primarily work in person", "I primarily work from home", "I do a mix of both")) %>% ggplot(aes(x = mean, y = explanatory_value)) + geom_point() + geom_errorbar(aes(xmin = low, xmax = high), width = 0.2) + facet_grid(explanatory ~ ., scales = "free", space = "free", labeller = labeller(explanatory = exp.labs)) + scale_x_continuous(limits = c(0, 5), breaks = c(0, 1, 2, 3, 4, 5), labels = label_wrap(10)(c("Less than 1", "1", "2", "3", "4", "More than 4"))) + labs( x = "Number of Cups \n(Error bars range from 10th to 90th percentile)", y = NULL, title = "Plot 2. Distrubution of Daily Cups of Coffee \nAmong Explanatory Variables", caption = "Source: 'The Great American Coffee Taste Test' Data Set \n Retreived from TidyTuesday 2024" ) + theme_minimal() + theme( panel.grid = element_blank(), #remove lines in plot panel.spacing = unit(0.1, "cm"), #spacing between facets strip.text = element_text(colour = "black"), strip.background = element_rect( #creating blocks for facet labels to match original fill = "grey90", color = "grey20", linewidth = 1), strip.text.y = element_text(angle = 0)) #turn explanatory facets from vertical to horizontal)``````{r, warning=FALSE, fig.width=5.5, fig.align="center"}#| label: plot 3#| code-fold: true#| code-summary: "Show the code"coffee_survey_percentage %>% ggplot(aes(x = explanatory_value, y = percent_answers, fill = response_value)) + geom_col(data = filter(coffee_survey_percentage, response_value %in% c("Less than 1")), aes(y = -percent_answers)) + geom_col(data = filter(coffee_survey_percentage, response_value %in% c("4", "More than 4")), aes(y = percent_answers)) + scale_fill_manual(breaks = c("Less than 1", "4", "More than 4"), values = c("#87A5A5FF", "#D2D2C3FF", "#A5A587FF")) + facet_grid(explanatory ~ ., scales = "free", space = "free", labeller = labeller(explanatory = exp.labs)) + labs( title = "Plot 3. Responses Outside of Average", x = NULL, y = "Percent of Responses", fill = "Number of Cups", caption = "Source: 'The Great American Coffee Taste Test' Data Set \n Retreived from TidyTuesday 2024" ) + coord_flip() + scale_y_continuous(labels = scales::percent) + theme_minimal() + theme(title = element_text(face = "bold"), panel.grid.major.y = element_blank(), legend.text.position = "top", legend.title = element_text(hjust = 0.5, face = "bold"), legend.background = element_rect( fill = "grey90", colour = "grey20", linewidth = 0.5), axis.title = element_text(face = "bold"), strip.text = element_text(size = 7), strip.background = element_rect( #creating blocks for facet labels to match original fill = "grey90", color = "grey20", linewidth = 1), strip.text.y = element_text(angle = 0)) #turn explanatory facets from vertical to horizontal))```### DiscussionAcross the board, the average numbers of cups of coffee coffee-drinkers typically drink per day is one to three cups, with most groups averaging around 2 cups, as shown in Plot 1 and Plot 2. However, a trend is revealed, especially correlating with number of children and age. The average increases from 1 to 3 cups with increasing age and increasing number of children. The average is also higher for those that are retired or work full-time, compared to those that work part-time, or are students, homemakers, or unemployed. There was no notable difference between those who work from home or in person.This trend is also revealed if only the percentages for the options outside of the average are looked at, as shown in Plot 3. There is a higher percentage of those younger than 18 years old and among those that do not have children that drink less than 1 cup of coffee a day. For those who drink 4 or more cups of coffee, it is more likely that they are older than 65 years old or have 3 or more children. Similarly, more retired individuals reported drinking 4 or more cups of coffee a day than any other employment type, and those that are unemployed had the highest percentage of individuals drinking less than 1 cup a day.These trends are not necessarily surprising - those working more with a full-time job or with more children may require more caffeine compared to those with no children and only working part-time. Similarly, older individuals may want more coffee, or have developed an affinity and enjoy multiple cups of coffee more for the taste than caffeine. While these trends may be telling of some correlation, none of the explanatory variables are necessarily causal to how many cups of coffee someone drinks per day. Future work should further breakdown these groups, and especially consider the weight of age or correlation between multiple variables. It is important to keep in mind, that the survey respondents are customers of Cometeer, and are not necessarily representative of the entire US or coffee-drinking population.## References